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Abstract. Crowd computing empowers computer systems by utilizing humans' per- 
ception, and their ability to solve non-algorithmic problems. In this approach, a group 
of humans are asked to contributively solve a problem that cannot be solved easily by 
individuals, or perfectly by computers. However, there are complexities in using hu- 
mans to solve problems. Lack of generative models, complex cost models, lower speed 
in comparison to computers, limitation of knowledge and skills, noise, bias and error are 
examples of such complexities. An optimized crowd computing system should overcome 
these complexities, and improve the quality of solutions. 

This paper includes answers to three main questions: What is crowd computing? 
Why should one use crowd computing? And, how to use crowd computing? We will 
briefly answer the two former questions, while we will focus more on the latter one, 
specially on solving classification problems using multiple checking scenario. In addi- 
tion, we will compare the current methods of crowed computing, and provide some 
guidelines for future works based on the current open issues in this field. 

Keywords: Crowd computing, Crowdsourcing, Mechanical Turk, User modeling, Sam- 
ple selection, Label estimation. 



1. Introduction 



There are quite compelling arguments against current computer systems, because 
many applications (especially in the domain of artificial intelligence) suffer from 
current system imperfections. For instance, computers are not completely capa- 
ble of understanding image contents; hence, they cannot search within images 
efficiently with respect to their contents. In order to dissect these contents, they 
can only process the syntactical features of the images such as color, texture, 
and shape. Clearly, representing the content by using such low-level features can 
be erroneous. This problem is similar to estimating the level of happiness of a 
human in an image by only considering its color histogram! 

Such complications have initiated the motivation to search for alternative 
approaches to the classic AI systems. One of the best proposed solutions to 
this problem is "crowd computing" . Crowd computing, or problem solving us- 
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ing crowdsourcing, utilizes humans' perception (understanding and feeling), and 
intellectual abilities to solve non-algorithmic problems. 

There are many complexities in using crowd computing. Humans' problem 
solving processes cannot be completely modeled, and there are no generative 
models to predict the answers of the future problems. In addition, for each solu- 
tion provided by any human for any problem some cost (or bonus) must be paid. 
Moreover, usually not all participating humans are experts, and their knowledge 
is limited. Furthermore, human decisions maybe noisy, and erroneous. Finally, 
humans have lower computing speeds compared to the current computer systems. 

A crowd computing system must overcome the aforementioned complexities. 
In recent years, a number of general frameworks are proposed for this purpose. 
We call them "crowd computing scenarios" . We will introduce these scenarios 
in the next section. In practice, expressing problems in terms of those scenarios 
is not light-handed, and is varying for different problems and applications. In 
addition, in most cases, the expression leads to generating a large number of 
small problems which are costly to solve. 

In this paper, we survey the concept of crowd computing, its challenges and 
complexities, and the ways to overcome its problems. The rest of paper is orga- 
nized as follows. In section 2, crowd computing concepts, scenarios, and system 
design will be discussed. Section 3 presents the arguments that why crowd com- 
puting is required. Section 4 provides the technical aspects of crowd computing 
and its scenarios; including literary methods for sample selection, user modeling, 
and integration of user provided solutions, as well as their analysis and compar- 
isons. The concluding remarks and open issues are presented in the last section. 

2. What is crowd computing? 

Crowd computing is referred to problem solving using crowdsourcing. Crowd- 
sourcing, which is semantically a composition of the terms "wisdom of crowds" 
and "outsourcing" , was coined by Jeff Howe in 2006 in an article in Wired Mag- 
azine (Howe, 2006). He also published a book by the same title in 2008 (Howe, 
2008). In crowdsourcing, a group of people are asked to contributively do a task 
that cannot be easily done by a single individual. For example, "Wikipedia" is 
one of the most recognized crowdsourcing systems. In this system, thousands of 
Internet users are participating in the creation of the world's largest encyclope- 
dia. 

Some economy and management researchers believe that crowdsourcing can 
change the future of business (Malonc, 2004; Howe, 2008). The most remarkable 
applications of crowdsourcing are: creation (e.g. Wikipedia or open-source soft- 
wares), standby human resources (e.g. "Rent a Coder", an active network in the 
field of software design and development), R&D (e.g. InnoCentive) , crowd fund- 
ing (e.g. KickStarter network for funding creative projects), forecasting (e.g. 
Threadless network which estimates the success rate of T-shirt designs in the 
market), organization (e.g. Digg network for organizing Internet links) and crowd 
computing or collective intelligence (e.g. Amazon Mechanical Turk network as a 
marketplace for providing solutions to micro tasks). 

In this section, we introduce crowd computing. First, we illustrate crowd 
computing by using an example. Then we present the crowd computing scenarios. 
Next, we propose the properties of problems and applications that are suitable to 
be solved using crowd computing. Then, in order to correct the expectations from 
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Fig. 1. Sample CAPTCHA tests (left), and a sample reCAPTCHA test (right). 

crowd computing, we report on performance of some crowd computing systems. 
Finally, we describe three steps that are required to design a crowd computing 
system. 

2.1. Illustration 

Correcting the errors of classical Artificial Intelligence (AI) systems is one of the 
crowd computing applications. An example of such systems is Optical Charac- 
ter Recognition (OCR). Although, current OCR systems have acceptable per- 
formances on high quality scanned texts, they have poor performances on low 
quality or old faded-ink characters. Crowd computing can help OCR systems in 
recognizing these documents more efficiently. 

ReCaptcha is an example of crowd computing system that helps OCR systems 
recognize low quality documents, while protecting websites from bots attempting 
to access restricted areas (von Ahn et al, 2008). In order to analyze scanned 
documents, reCaptcha uses two different OCR systems. The respective outputs 
of these OCRs are aligned and compared with each other, and then are checked 
with a reference dictionary. Any word that is deciphered differently by both OCR 
programs or is not in the dictionary will be marked as suspicious. The suspicious 
words' images are the system sub-problems which must be recognized. 

ReCaptcha uses CAPTCHAQ tests. A common type of CAPTCHA test shows 
an image of distorted letters or digits, and asks the user to type the text shown 
in that image. Human can recognize the distorted texts, but computers cannot 
(von Ahn et al, 2003). A sample CAPTCHA is illustrated in Fig. []} (left). 

In contrast to a CAPTCHA test, a reCaptcha test shows two words in an 
image; A distorted word with known digitized correspondent, along with a word 
that the OCR system is unable to recognize. The first word will be used for 
user validation, while the second one will be used to help OCRs to recognize 
"suspicious" words. Users are asked to type both words correctly before being 
allowed through. A sample reCaptcha image is shown in Fig. [!]■ (right). 

As users do not know which word is the control word, reCaptcha assumes 
that if they type the control word correctly, the questionable word is also correct. 
In addition, in order to overcome human errors and frauds, it uses a multiple 
checking mechanism. Those words that are consistently given a single identity 
by humans are recycled as control words (von Ahn et al, 2008). 

The tests over a dataset containing all 24080 word images of 50 random 
scanned articles from five different years (1860, 1865, 1908, 1935, and 1970) of 
the New York Times archive, indicate significant results of reCaptcha. ReCaptcha 
has achieved an accuracy of 99.1%, whereas the accuracy of a standard OCR has 
been 83.5%. In this example, 6260 words were marked as suspicious, which only 



Completely Automated Public Turing test to tell Computers and Humans Apart 



4 



J. Muhammadi and H. R. Rabicc 



about 4% of them has been recognized by both OCR programs, while 95.94% 
of them has been recognized by reCaptcha. In the first year after launch, more 
than 40,000 websites deployed reCaptcha, and over 440 million words has been 
transcribed by reCaptcha (von Ahn et al, 2008). 

ReCaptcha does not pay any money to the websites or the users; rather, it pro- 
vides an authentication service to them. However, not all the crowd computing 
applications have access to such opportunity, and rather they use marketplaces 
that charge some money to provide human answers to their micro problems. 
Amazon Mechanical Turk (MTurk) is a famous crowd computing marketplace. 
MTurk has more than one hundred thousand members and hundreds of thou- 
sands research and business tasks. MTurk has provided developers with APIs 
that enable them to connect their systems directly to MTurk. The main reasons 
of MTurk's popularity are: large number of members, high diversity of mem- 
bers' knowledge, skills, locations, cultural differences and socio-economic status, 
low-cost labors and fast cycle of theory and test (Mason et al, 2010). 

2.2. Scenarios 

In a crowd computing system, each problem or application (e.g. recognizing 
words of a document) is divided into several sub-problems (e.g. single word im- 
ages). Some of sub-problems are machine solvable (e.g. images that are recogniz- 
able by both OCR systems in reCaptcha), while some others are not and they 
need human intelligence (e.g. suspicious word images) . In order to remove noise, 
bias, and error, the provided solutions by humans should be validated (e.g. by 
a multiple checking mechanism). Finally, the provided solutions must be inte- 
grated to extract the solution of the sub-problem (e.g. the text of the single word 
image) . 

ReCaptcha uses website users to solve the hard problems, and a multiple 
checking mechanism to overcome the noise due to humans mistakes, bias and 
error. In a general multiple checking scenario, the sub-problem solutions are 
requested from a number of humans, and each human is paid a small amount 
of money for each solution. In addition, there is a mechanism for integrating 
the provided solutions. The simplest mechanism for integration is the majority 
voting. 

Multiple checking is not the only scenario in crowd computing. Game With 
A Purpose (GWAP), and iterative tasks are also among favorite scenarios in 
crowed sourcing. 

In general, the designers of GWAPs are trying to embed their problems into 
a game. The first GWAP, named ESP, which is an online game for image anno- 
tation, was originally conceived by Luis von Ahn (von Ahn et al, 2004). In this 
game, two online users are paired randomly by the system. The paired players 
do not know each other, and they do not communicate. An Image is displayed 
to the players, and in a specified period of time, the players independently guess 
the image content by presenting text tags (They do not see each others tags). 
The players win the game just in the case that one of them presents a tag which 
is presented before by the other player, considering the constraint that they are 
not allowed to use the words presented in a taboo list. The taboo list is provided 
by the system in order to exclude the obvious tags or the tags obtained previ- 
ously for that image in the other games. The resulting tag is then used as a new 
annotation for that image. 



Crowd computing: a survey 



5 



ESP game can provide one tag for all images in the Google indexed images 
only in a month, if it becomes a high ranked game in the online games websites 
(considering the statistics of the year that the paper is published) (von Ahn et 
al, 2004). 

"Iterative" or "collaborative tasks" is another scenario for crowd computing. 
In this scenario, users build on or evaluate each others answers (Potter et al, 
2010). Here, in contrast to parallel tasks, users have access to the other users' 
answers. Since it might bias the user's mind, it is not suitable for tasks like 
voting, or brainstorming (Little et al, 2010). In addition, parallel and iterative 
paradigms can be used together. For example, in a text improving problem, each 
passage is improved by a user, another user improves the first user's work, and 
the third user selects the best of these two. Another user improves the winner 
work, and the cycle will continue until reaching the stopping criteria (Little et 
al, 2009). 

2.3. Suitable problems and applications 

A suitable application or problem to be solved by crowd computing should have 
the following features: 

— It should be divisible into several sub-problems. These sub-problems should 
be almost independent (They could be solved in parallel). They should also 
be static in time, in order to keep the validity of the integrated solution of the 
original problem. 

— A large number of sub-problems should be non-solvable by machines, and could 
be solved by a regular human. The solution of these sub-problems should be 
independent of the users, and verifiable by other users. 

— Solving the main problem using a small group of expert people should be 
costly. 

— There should be a feasible method to divide the original problem into sub- 
problems, and to integrate the sub-solutions. 

For example, consider the Content Based Image Retrieval (CBIR) application 
using image annotations. In one hand, image annotating based on the image 
content, is a very hard problem and should be done by humans for millions 
of images. On the other hand, collecting and organizing the images and their 
corresponding tags as well as searching within them can be only done efficiently 
by computers. Then it is rational to use crowd computing for this application. 

There are a number of other applications that are suitable for solving by 
crowd computing: Writing text improvement (Bernstein et al, 2010), evalua- 
tion of music similarities (Urbano et al, 2010), measuring the relevance be- 
tween results and keywords in search engines (information retrieval) (Carvalho 
et al, 2011; Grady et al, 2010), text translation (Corney et al, 2010), evaluation 
of Common sense Knowledge (Gordon ct al, 2010), affect recognition in text, 
image and video (Snow et al, 2008), building train and evaluation datasets for 
classic machine learning algorithms (Bloodgood et al, 2010), and error-detection 
in classic Al systems (such as OCR systems) (von Ahn et al, 2008), arc samples 
of these applications. 

From another point of view, two categories of problems are suitable to be 
solved using crowd computing: problems which need human consciousness, and 
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problems which need human common sense knowledge. We will survey these two 
categories, in section |3] 

2.4. Performance 

The reported results of implemented crowd computing systems show the power 
of this approach in the candid practical applications. Let us have a look at the 
results reported in (Bernstein et al, 2010), where crowd computing is used for 
typo and grammatical error correction, summarization and overall improvement 
of English texts, written by non-native writers. The authors used MTurk as 
their human network. The reported results show that the output texts had 10 to 
22% shorter length in the summarization tests, and 67% less errors in the text 
correction than input texts. The cost of such a task has been $1.41 per paragraph. 
Although choosing the sub-problems by MTurk users takes a long time (the 
waiting time), solving sub-problems by them are quick. Increasing the number 
of marketplaces in the future, or creating the proprietary human networks for 
applications, results in decreasing the waiting time. Disregarding the waiting 
time, considering the large number of users and the possibility of doing the tasks 
in a parallel mode, leads to decreasing the overall time of improving any text to 
less than five minutes. 

In (Snow et al, 2008), concerning the accuracy of the user answers, MTurk 
users are compared to experts. Five different categories of natural language pro- 
cessing tasks are used for this purpose. These categories are affect recognition 
in texts, word semantic similarity measuring, recognizing textual entailment, 
event temporal ordering and word sense disambiguating. Each problem in each 
category is solved using one expert and multiple MTurk users. The MTurk an- 
swers are integrated by using the majority voting. The experimental results were 
significant. Almost in all experiments, the quality of the majority vote was at 
least as good as individual expert answers. For example, in the affect recognition 
tasks, in five of six emotions (except fear), the integrated solutions had better 
quality than the expert ones. The number of users that are required to obtain 
such results has been between 2 to 9 persons for different categories. In the word 
semantic similarity measuring tasks, the correlation of 10-user answers were the 
same as the expert answer. Also, in all tasks, the quality of the integrated solu- 
tions were directly related to the number of participant users (the higher number 
of answers, the higher accuracy of the task). The time and the cost are reported 
as 840 tasks per hour and 151 tasks per dollar. 

2.5. System design 

There are three steps in designing a typical crowd computing system. These steps 
are: defining the system grand strategies, generating the sub-problems, designing 
and optimizing the processes. We explain each of these steps in the following. 

Defining the system grand strategies: In designing a crowd computing system, 
first, we must define the system grand strategies. Examples of such strategies are: 
Is the system active or passiveQ? Which scenario will be used to overcome the 



2 In an active system, the problems will be assigned to the users by the system, while in a 
passive system all problems will be imported into a pool, and the users select their desired 
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human errors, bias and noise? Which network or human work marketplace will 
be used? Do users compete in proposing the solutions or not? 

Defining different strategics impose different effects on the whole system. For 
example, in a non-competitive system, the amount of reward does not affect the 
quality of outputs, however, it affects the waiting time (Bernstein et al, 2010), 
while in a competitive network, change in the rewards directly affects the quality 
of results (Yang et al, 2008). Another example is the type of task assignment in 
the system. In active systems, each user can be modeled by its history. These 
models can be used to identify the user which is best suitable to assign a speci- 
fied task to. In passive systems, there is no expressive task assignment. However, 
the problems are so designed to maximize the probability of being selected by 
some group of users. It requires to know the criteria that are used by the users 
in selecting the problems from the pool. Samples of these criteria are: The time 
of importing the problem to the pool, the problem's expiration time, and the as- 
signed reward to the problem. The best values for these criteria can be estimated 
by using the user behavior assessments. 

Generating the application's sub-problems: A target application or problem 
should be divided into several sub-problems, so that each sub-problem is solvable 
in a short time by a user with regular skills and knowledge. Dividing the main 
problem into sub-problems is done according to the system main scenario. In 
GWAPs, each sub-problem is a game level, while in a multiple checking scenario 
each sub-problem is a simple classification problem. 

Crowd computing can also be used to generate sub-problems. For example, 
consider the text improving application (correcting text typos and grammatical 
errors). Suppose that we want to use crowd computing and multiple checking 
scenario. In (Bernstein et al, 2010), first, each paragraph is imported to the 
system as a simple binary classification question, and the users are requested 
to identify whether that paragraph requires any correction or not. Then, each 
sentence of each paragraph that requires corrections is imported to the system 
as a simple binary classification question and the users are requested to identify 
whether that sentence requires any correction or not. In this step, the sentences 
that require corrections arc identified. In the next step, each sentence is given 
to multiple users to revise. Then, the set of revised sentences for each target 
sentence are imported to the system as a multiple classification problem, and 
the users are requested to select the best one. Finally, the original sentences are 
replaced by the results of the last step. 

Several factors are important in generating sub-problems. For example, how 
to divide problems into sub-problems? What are the type of sub-problems? When 
each sub-problem should be imported to the pool? When is the expiration time of 
each sub-problem? How many answers are required for each sub-problem? And, 
how much is the amount of reward which will be paid to the users? Providing 
the proper solutions to these questions determine the level of success for a crowd 
computing system. 

A good approach for choosing the proper value for a factor in designing the 
sub-problems, or assessing its role in the overall performance is "user behavior 
assessment" . For example, to assess the role of reward for a set of sub-problems, 
we can import several problems of that set with different reward amounts. Then, 



problems from that. Since users are not present in all times, they do not desire to answer the 
assigned questions (especially in the specified times) , and the active systems are more complex 
than passive ones, most of the current systems are passive. 
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we should measure the performance of our system (e.g., the spent time, and the 
quality of answers). Analyzing such parameters helps us to assess the reward 
factor effects. In (Bernstein et al, 2010), similar experiments are done using 
MTurk marketplace users. The results of the experiments show that decreasing 
the reward amounts does not affect the quality of the solutions, while it increases 
the waiting time (because users are more interested in higher reward amount 
problems). 

Designing and optimizing the processes: Different levels of user skills, over- 
coming users errors and bias, and user costs make the crowd computing very 
complicated. Hence, some processes must be defined to ensure that the problems 
will be solved with highest quality, expending a specified budget. User modeling, 
sample selection, and labels integration are three main components in optimizing 
crowd computing systems. We survey these methods in the next sections. 



3. Why crowd computing? 

In one hand, computers are very fast and accurate, but they can not understand 
the world around them as good as humans. On the other hand, humans under- 
stand the world, but they can not process as fast and accurate as computers. 
For example, consider a classification problem. A human percepts and classifies 
the samples (or patterns) in the original space (pattern space). Compared to 
humans, a computer does not understand the pattern space. Each sample in the 
pattern space must be transferred to a feature space, by using some sensors. Also, 
the classification procedure should be dictated to the computers by humans. 

In contrast to humans, computers classify the feature vectors according to the 
dictated procedures. Although, the execution of these procedures are very fast 
and accurate, we may not use computers to solve all the problems. There are two 
main reasons for this: 1) In transforming the original space to the feature space a 
large amount of information will be lost, and 2) Current computers are restricted 
to algorithmic methods, which can not reflect the human mind's complex, and 
probably non-algorithmic methods. 

The goal of crowd computing is combining human perception, and brain 
power in solving non machine-solvable problems, together with computers' ac- 
curacy and speed, to create systems which have never existed before. 

From a cognitive point of view, two categories of problems are proper candi- 
dates for solving by crowd computing: problems that need human consciousness, 
and problems which require common sense knowledge. We present these two 
concepts, in the rest of this section. 

3.1. Consciousness 

In (Turing, 1936), Turing showed that any function that is calculable by means 
of an "effective"Q procedure can be calculated by means of a formal method 



3 In carrying out an activity, a procedure is considered effective if it can be set out in terms of 
a finite number of exact instructions; carried out without any errors; can produce the desired 
output in the finite number of steps; can be carried out, in practice or in principle, by a human 
being unaided by any machines or tools, except paper and pencil; and demands no insight or 
ingenuity of the human being for doing the job (Copeland, 2008). 
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(such as a Turing machine). Independently and at the same time, Church also 
showed the same concept through a totally different method using A-calculus 
(Church, 1936). Therefore, the thesis which indicates the equivalency of formal 
methods and effective procedures is known as Church- Turing thesis. 

Despite the universal consensus about the Church- Turing thesis, and al- 
though some researchers are focused on providing a prove to this thesis (e.g. 
(Dcrshowitz et al, 2008; Boker et al, 2008; Boker et al, 2010)), no proof has been 
proposed for it, yet. 

Church- Turing thesis is valid in the digital world. Disregarding the analog 
computers which may be developed in the future (such as quantum computers) 
and restricting the so-called "current computers" to digital machines, we can 
claim that current computers are effective. 

Assuming the well-known Rosenthal's Higher Order Theories (HOT) of con- 
sciousness (Rosenthal, 1997), unlike current computers, in ncurobiological crea- 
tures, non- algorithmic activities can be done. In those creatures, non-inferential 
and non-observational beliefs about a mental state can be formed by another 
mental states, which is called consciousness. 

What is the role of consciousness in problem solving? Consciousness plays an 
important role in humans feeling, perception, and in general understanding the 
world. Having consciousness, humans have access to the original information, 
while computers' information is a reflection of humans information, which is 
defined algorithmically to them. 

There are some unsuccessful attempts to create artificial consciousness. For 
example, connectionists believe that artificial creation of consciousness is possi- 
ble, if the number of artificial neurons and connections between them is more 
than a specified threshold. According to (Buttazzo, 2001), this threshold is equal 
to the number of human brain neurons and the number of connections between 
them. Implementing such systems requires a large amount of memory. Accord- 
ing to the Moore's law, the authors claim that this amount of memory can be 
available within the next 20 years. Other similar estimations can also be found 
in (Kurzweil, 2000; Paul et al, 1997; Moravecl, 2000). 

3.2. Common sense knowledge 

Common sense Knowledge Base (KB) problem deals with the facts that a human 
knows. The question is; How can we capture, save and use all these facts (Waltz, 
2006). This problem was originally proposed by Marvin Minski in 1992 (Minsky, 
1992) in the context of slow progresses in the Natural Languages Processing 
(NLP). He stated that computers do not access to the words and the objects 
meanings, as the humans do. With the "ROPE" , as an example, someone can 
pull something, but he can not push it. He ip something with it, but he 

can not cat something with it. Even a child can describe more than a hundred 
applications of a rope, or any other objects and words, in a few minutes. But a 
computer can not do that. A human- like NLP system must access to such a KB, 
while there is not such a KB, at all. 

This problem is not specific to the NLP area, but it may also exist in all 
other areas. For example, in the machine vision area, to recognize an object by 
a human as a chair, it is not necessary to see an object exactly with four legs 
and one back. A human can recognize any usual or unusual chair, based on its 
shape, functionality or its relations to the other objects in that world. A human- 
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like machine vision system must access to all objects shapes, functionalities, and 
their relations to the other objects, while it does not access. 

Creating a common sense knowledge base is very difficult, because (McCarthy, 
2007; Dreyfus et al, 1988): 

1. A large amount of information must be captured. 

2. There is not a proper knowledge representation method. 

3. Updating the KB facts, is very difficult. 

4. There is no efficient method for using and inferencing that knowledge. 
4. How to crowd compute? 

As we mentioned before, there are three main scenarios for crowd computing: 
multiple checking, GWAP, and iterative tasks. The procedures of GWAP and 
iterative tasks scenarios highly depend on the application. In contrast, the pro- 
cedures of multiple checking scenario can be formalized, and also optimized. 

In this section, we formally define the problem of crowd computing by using 
multiple checking scenario. Then, we describe the main approaches for solving 
it, in details. 

4.1. Problem definition 

Consider a passive crowd computing system for solving classification problems 
which uses a multiple checking scenario. In this system, as we see later, the 
probability of estimating true label for a sample is highly related to the number 
of participant labelers in labeling that sample. Then, the quality of solution 
highly relates to the user costs. Since higher qualities in presence of a specified 
budget is desired, there is a constrained optimization problem. 

Consider a binary classification problem^] with samples X = {x{\f =l and 
their unknown true labels^ Y = {yi}fLi> Vi € { — 1,+1}. Users of the system 
are denoted by U — {uj jjLi- Provided labels by users to samples are shown by 
A = {AijYrL^'p* , A £ {0,-l,+l} NxR , where means that the user did not 
provide any label for that sample. The final number of all collected labels for all 
samples (the budget) is limited, i.e. J2iLi SjLi = B- 

The goal is to estimate the gold standards (finding y^s) that maximizes P(yi = 
Vi\xi,A). 

The basic strategy is to assign the same budget equally to all samples, and 
to acquire approximately the same number of labels for all problems. Also, the 



4 The result algorithms and equations for binary classification problems are extensible to 
general classification problems. In addition, techniques are proposed to transform some 
non-classification problems to classifications ones in crowdsourcing systems. For example in 
(Bernstein et al, 2010; Little et al, 2010), a non-classification problem is given to the users, and 
their solutions are acquired. Then, all solutions or part of them are considered as set of possible 
labels (classes) for the original problem. Now, this classification problem can be solved using 
crowd computing. Also, in (Frank et al, 2001) a method is proposed to convert the scoring or 
ordinal regression problems to classification ones. And, in (Janssens, 2010) solving the problem 
of sorting images based on their content, using binary classification methods is described. 

5 True labels are named 'gold standards' or 'objective ground truths', too. 
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basic method for integrating the collected labels and estimating the y~iS is the 
majority vote: 

P{y t = +l\ Xi ,A) = P(y t = +l\At) = ^ (1) 

where denotes the i th row of the matrix A, and r+ and r"j are the number of 
+ls and the number of non-zero elements of Ai, respectively. 

In the majority vote, as number of acquired labels grows up, both costs and 
the quality of integrated labels would increase, while higher quality alongside 
a specified cost is desired. How to increase the quality of solutions while fixing 
the costs? There are two main approaches: 1) using inductive methods, and 2) 
planned and purposeful budget spending. 

In the first approach high quality labels for some samples will be extracted. 
Then, a classifier will be learned using these samples and their estimated labels. 
Then, labels of other samples will be estimated using the designed classifier. 

The second approach considers user expertise and the problems' properties 
(e.g., type, difficulty level, . . . ). In each step, a label is requested for the sam- 
ple which acquiring a new label for it would lead to a maximum increment in 
the overall quality. Finding such a sample is a hard decision making problem. 
We name this problem as "sample selection". Sample selection can utilize the 
history of users' activities. The user histories can be stored as statistical mod- 
els. We name the process of specifying a model for user activities, and finding 
its parameters as "user modeling". Sample selection, user modeling and labels 
integration are three main components of the second approach. 

In the following, we describe inductive approach and the three components 
of the second approach, in more details. 



4.2. Inductive approach 

In the inductive approach, a generative model classifies new samples. By using 
the active learning approach this model can be empowered by new samples. The 
labels of both training set and active learning samples are obtained using crowd 
computing. In the active learning approach, each new sample either is classified 
by the model, or is used to improve the model. A decision making problem arises 
here. 

There are several criteria to make the decision about the new samples. A 
sample criterion is the uncertainty of the label that is assigned to the new sam- 
ple by the classifier (Sheng et al, 2008). Inaccuracy probability of the assigned 
label can be considered as the label's uncertainty. Another sample criterion is 
expected information gain (Paquet et al, 2010). This criterion assumes that the 
classifier is a parametric model. In each step, the model's parameter, 6, once is 
estimated by using the current training set (D), and the other time is estimated 
by using the current training set empowered by the new sample and its ground 
truth (D + {x, y}). The gained information by adding the new sample (i.e., the 
entropy difference between two models) can be measured by using the Kullback- 
Lcibler divergence method. The difference amount shows the suitability of the 
new sample for empowering the model. Since the sample's ground truth (y) is 
not known, its expected value, which is calculated by using the classifier, will be 
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used. Then, the expected information gain is (Paquet et al, 2010): 



A({x,y})=E p{ylx , D) E mD) log -^-^—y^ 




(2) 



The inductive methods require that objects be described as feature vectors. 
Hence, a transformation from the pattern space to the feature space is required. 
As we mentioned in "why?" section, we are not interested in this approach, and 
we will focus on the other approach. 



4.3. Sample selection 

Suppose that there is a specified limited budget for labeling a set of samples, and 
the labeling cost for all samples are equal (uniform labeling cost). In addition, 
suppose that the labeling process is adaptive, i.e., in each step, one label will 
be requested for one of the samples based on the collected labels. A rational 
procedure selects a sample that getting a new label for it maximizes the overall 
quality. 

There are several criteria for sample selection. The simplest criterion is se- 
lecting the sample that has the minimum number of current labels. The result 
of this criterion is approximately the equal number of labels for all samples. We 
call this criterion "uniform" . 

As simpler samples need fewer labels, in most cases uniform criterion wastes 
the budget. Sample selection based on the current labels' heterogeneity, and 
based on the uncertainty of their integration are instances of non- uniform criteria. 

Heterogeneity criterion selects a sample that the Heterogeneity of its current 
labels is minimum. The heterogeneity of a set of labels can be measured by using 
their entropy (Sheng et al, 2008; Ipeirotis et al, 2010). 

Entropy is a proper measure for heterogeneity in this problem, but it has an 
undesirable bias. For illustrating this bias, suppose that the ratio of the dominant 
class in the current labels of sample Xi is pi. The entropy criterion selects the 
sample xi where, 



It means that entropy always selects the sample which its current dominant 
labels' ratio is closer to 0.5. It implicitly indicates that entropy is biased in 
some cases toward selecting the samples with more labels than the others. For 
example, entropy never selects a sample with only one label in its current label 
set. Experimental results show that using entropy results in having a few samples 
with many labels, and a lot of samples with few labels. 

In general, entropy does not consider the number of labels. In addition, all 
heterogeneity-based criteria lead to poor results in case of noisy labelers. Because, 
heterogeneity criteria do not consider the labeler expertise (they welcome the 
same noisy labels for samples). 

Uncertainty criterion uses the inaccuracy probability of the estimated la- 
bels. Consider sample Xi with L\ + Li acquired labels that L\ of them indicate 
class +1 and L 2 of them indicate class —1. The likelihood P{L\, L 2 \yi is bino- 



l = argmaxi{-pi lg(pi) - (1 - pi) lg(l - pi)} 
= argmimlpi} 



(3) 
(4) 
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mial. Assuming equal prior probabilities for two classes, the posterior probability 
P[yi\L\, £2) has a (3 distribution^. 

Since the cumulative density function of /3 distribution is regularized incom- 
plete beta function (I x ), if io.5(-ki + 1,-^2 + 1) > 0.5 the integrated label is +1, 
otherwise it is —1. Also, the uncertainty is given by (Sheng et al, 2008): 



The proposed criterion in (Sheng et al, 2008) does not consider the user models. 
But, it is extendible to use any assumed statistical model in the problem formu- 
lation. In general, if current collected labels are stored in A, and the assumed 
models are specified by the parameters set 9, then the uncertainty of sample Xi 
based on the current data is: 



Conclusion: Uniform criterion is very simple. But, it wastes the budget, 
and thus is not efficient. The entropy criterion is simple and intuitive. But, it 
is biased. Uncertainty criterion is more complex than the other criteria. But, it 
can use any assumed statistical model. It can also handle the noisy labelers, in 
case of assuming the proper user models. 

Since little attention has been paid to the adaptive methods in the literature, 
only a few methods have been proposed for sample selection. Moreover, uncer- 
tainty criterion is used with restricted statistical user models in the context of 
sample selection. 

As a guideline for future works on this topic, we mention four properties for 
an efficient sample selection criterion: 

1. Considering the current acquired labels. An efficient criterion must consider 
the number and qualities of the acquired labels for each sample, up to now. 
The uncertainty criterion satisfies this property. 

2. Estimating the future. An efficient criterion should estimate the changes in the 
overall performance, after acquiring a new label for its selected sample. The 
criterion must select a sample that leads to the maximum expected improve- 
ment in the overall performance. None of the proposed criteria estimate the 
future. 

3. Avoiding local optima. An efficient sample selection criterion is not greedy. 
It considers the overall performance, not the maximum improvement in the 
current step's performance. All of the presented criteria are greedy. 

4. Avoiding bias toward selecting improper samples. Except the entropy all of 
the other presented criteria are unbiased. Any other efficient sample selection 
criterion is not also biased to select samples with undesired features. 

There are also some other related open questions which have not yet been 
addressed by any researchers. The role of considering exploration alongside ex- 
ploitation, or the role of deterministic or proportionally random sample selection 
are samples of these questions. 



Ud = min {I . 5 (Li + l,L 2 + 1), 1 - I . 5 (Li + l,L 2 + 1)} 



(5) 



Ud = min c={ _ li+1} {P(y 4 = c\x l: A, 9)} 



(6) 



6 Uniform distribution is the special case of /3 distribution, and /3 distribution is the conjugate 
of the binomial distribution. 
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4.4. User modeling 

User modeling indicates two types of modeling: users behavior modeling and 
user expertise modeling. User behavior modeling is only devoted to the passive 
systems, while user expertise modeling play a critical role in both active and 
passive systems. 

The goal of users behavior modeling is to discover the factors that users 
consider in selecting some problems from the pool, in passive systems. Finding 
these factors helps system designers to deduct policies in order to achieve op- 
timum performance. The policies are used in designing sub-problems, including 
determining type of problems, rewards, time to enter the pool, and persistence 
length in the pool. These factors can be found using user behavior assessment 
tests (Yang et al, 2008; Mason et al, 2010; Zhu et al, 2010). 

The goal of user expertise modeling is creating parametric statistical models 
for user performances, using their histories in the system. 

From one point of view, various models differ in the utilized parameters and 
their properties. Accuracy, sensitivity and specificity, and reliability are samples 
of model parameters. In addition, each of parameters can have different prop- 
erties. For example, accuracy can be modeled in different ways, such as: one 
accuracy parameter for all users, one accuracy parameter per user, and differ- 
ent accuracy parameters for each user in facing different categories of problems. 
Moreover, each of those can be stationary in time, or time varying. 

From another point of view, there are different scenarios to calculate the 
model parameters. The first scenario is modeling based on a dedicated training 
set, and using the obtained models in both sample selection and label estimation 
phases. The second scenario is user modeling after sample selection phase. In this 
approach, all labels are acquired from users. Then, using these labels, the user 
models and the integrated labels are simultaneously estimated. This scenario is 
named "one-shot" . No training data is required in one-shot scenario. But, the 
user models can not be used in the sample selection phase. The last scenario 
adaptively updates the model parameters. In each step, a part of labels are 
acquired based on the current labels and the estimated models. Then, the user 
models are updated. In this scenario, the updated user models in each step will 
be used in the sample selection phase of the next step. Most of the current 
researches use the one-shot scenario. 

Due to limitations, and noting that most of users behavior modeling methods 
are heuristic; we do not probe the proposed methods in this area. In the rest of 
this section, we will survey the various proposed methods for user expertise 
modeling. Note that the "user model" term in the next parts refers to "user 
expertise model" . 

Here, we will survey the various types of accuracy, sensitivity and specificity, 
reliability, and expertise models. 

4-4-1. Accuracy modeling 

The simplest type of user expertise modeling is "uniform accuracy modeling" . It 
uses one accuracy parameter for all users. 

If we show the collected labels for problem Xi by z\, . . . , Z2L+1, uniform ac- 
curacy modeling assumes that P(zj = y%\xi) = P{zj = y-i) = p. According to the 
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Fig. 2. The quality of estimated labels, based on the number and the quality of 
collected labels. 



binomial distribution, we have: 

q = P (y t = ( 2L + p^~\l -pf (7) 

k=0 ^ ' 

where k is the number of potentially incorrect answers, yi is the estimated label 
using the majority voting method, and q is the probability that more than L 
labeler propose correct labels. 

According to Eq. [71 q is bigger than p, iff p > 0.5. Also, if p > 0.5 as L 
increases, q increases. While, the rate of changes is decreasing. It means that the 
rate of changes depends on both p and L. For example, increasing the number 
of labelers leads to more significant results when p — 0.7 compared to the case 
where p — 0.9 (Sheng et al, 2008). In Fig. [2] the value of q based on the number 
of collected labels per each sample is shown, for different values of p. 

In (Eagle, 2009), a method is proposed to non- uniform user accuracy mod- 
eling, which considers different accuracy parameters for each user. Since this 
method is quite similar to the previously proposed method in (Dawid et al, 1979) 
for user sensitivity and specificity modeling, we will only present the latter 
method here. 



4-4-%- Accuracy modeling using interval estimation (IEThresh method) 

IEThresh was proposed for assigning problems to users in active systems (Donmez 
et al, 2009). In statistics, given Tj observations of variable Uj from distribution 
d, interval estimation method estimates an interval that the next observation 
belongs to, with the probability 1 — a. 

IEThresh estimates accuracy intervals for all users, which show the correct- 
ness probabilities of the next provided answers by the users. IEThresh selects 
a user with the highest accuracy interval's upper bound. Higher upper bound 
indicates higher expected accuracy (when the interval length is short) or higher 
uncertainty (when the interval length is long). Then, IEThresh considers both 
exploitation and exploration. The interval's upper bound is estimated as: 

[/( % )=mK)+^- 1) ^ (8) 
where ^{uj) and <j{uj) are mean and standard deviation of correct answers that 
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Fig. 3. SFilter's hidden Markov model (The shaded variables are observed). 



are provided by user Uj (the correct answers are estimated using comparing 

the proposed labels to the majority votes). The t^I 3 ^ is the value of t-student 

distribution when degree of freedom is Tj — 1, and the level of confidence is a/2. 

Having enough time (observing a lot of user answers), IEThresh leads to very 
good results in active systems, even if the number of high quality users is small. 

4-4- 3. Time varying accuracy modeling 

SFiltcr assumes that the user accuracies are not stationary in time, and pro- 
poses a time varying algorithm for user accuracy modeling (Donmez ct al, 2010). 
SFilter is proposed for filtering out the low-quality users in active systems. The 
algorithm uses Sequential Bayesian Estimation. It also assumes that the maxi- 
mum rate of changes are small and known (Donmez et al, 2010). 

Suppose that p*- represents the accuracy of user Uj in time t, z* is the provided 
label by the user in time t (Aij = zh is the label which is provided by the user 
for problem xt in time t). The goal is to estimate P(pj\zj, . . . , zj), the posterior 
probability of the user accuracy. 

SFilter considers the following Markov model for modeling the accuracy 
changes, which is shown in the Fig. [3] 

P 5=/t(p5- 1 ,A t _ 1 )=p*- 1 +A t _ 1> A — iV(0, a 2 ) (9) 

This Markov model states that the accuracy of each user in time t only depends 
on its accuracy in time t — 1. And the proposed labels by the user in each time 
only depends on user's accuracy in that time. 

Using this Markov model, and considering that the user accuracies are values 
in range (0.5,1], SFilter calculates the transition probability from p* to 
using a truncated Gaussian distribution: 



where <f> is the standard Gaussian probability distribution, and $ is its cumulative 
distribution function. 

Having a problem's true label and the user's accuracy in time £, the user 
provided label is modeled as: 

P{zl 3 \p],y % ) = (p5)'(4=«)(l- p t)'(4*«) 

where / is the indicator function. In practice, the value of yi in Eq.[TT]is unknown. 
Suppose that zfj is the provided label by user Uj in time t for problem and 
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Z h(t) * s ^ e set °^ °ther users' provided labels for that problem. We have: 

n4j\p% z Ut))= E n4M>Vi = v) p (vi = v\ z Ut)) (12) 

!/£{-!,+!} 

where P is calculated by using the probability of integrated label from 
labels z\ J(t y 

P(y t \zl J(t) ) ^ P( yi )P(zi m \y t ) ^ P(y t ) J[ P(4 3 \y t ) (13) 

SFilter considers P(pj\Zj, . . . , z 1 ^ 1 ) as the prior probability of P(pj\zj, . . . , z*j). 
Using the Chapman-Kolmogorov equation, we have, 

np'MY) = — p{z v. t) — (1 4 ) 



p(^|p5,4 !t - 1 )p(4 :t - 1 b§)p( P §l4 :t - 1 ) 



8 |l^*- 1 )p(^- 1 ) 



Pjzjlp^Pjp^ 



(15) 
(16) 



where zf denotes a* . . . , z\, and P^zf 1 - 1 ) = j pt -, P(p*b*" ^P^*" 1 !^ 1 ^" 1 )**" 1 . 

All model parameters must be updated after getting any new label. Since 
this job is very time consuming, an incremental algorithm is also proposed in 
(Donmez et al, 2010). In the incremental version of SFilter, discrete posterior 
approximation is estimated by using the sequential particle filtering. 

Experiments show that if changes in user accuracies are according to the 
considered model, SFilter can track the changes in time (Donmez et al, 2010)! 

4-4-4- Sensitivity and specificity modeling 

Other parameters for user expertise modeling in binary labeling problems are 
sensitivity and specificity. Sensitivity indicates the proportion of actual positive 
samples (samples which belong to class +1) that are correctly recognized. Sim- 
ilarly, specificity indicates the proportion of actual negative samples (samples 
which belong to class —1) which are correctly recognized. The generalization of 
these parameters in multi-classes problems is the set of independent elements of 
confusion matrix. 

Dawid and Skene in 1979 proposed an expertise modeling by using confusion 
matrix in multi-class medical diagnosis tests (Dawid et al, 1979). They calculated 
the likelihood of true answers for all samples, then maximized that by using the 
Expectation-Maximization (EM) method. 

Consider a multi-class classification problem with samples {xi}^ and their 
gold standards {yi\f = i- Each belongs to one of classes {Ci}f =1 . The class prior 
probabilities are {P(Ci)}f =1 . Also, assume that each of R users in the system 
provide none, one, or multiple labels for each of samples, which are stored in 
matrix A. 

First, we assume that the gold standards are known, and then generalize the 
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results for the case that the gold standards are unknown. Suppose that n% is the 
number of labels Ci which is assigned to problem Xi by user Uk- Ti q is one, if 
label C q is a true label for problem Xi and it is zero, otherwise. Also, irf^s are the 

elements of the confusion matrix of user Uk (V i, k J2i = i *a = !)• The likelihood 
of extracting true labels for all samples is: 

N J r R J k ) Tij 

l = P (A\n, p) = n n n n (^r w 

i=lj=l { fc=U=l J 

which II is the set of all user confusion matrices' elements, P is the set of all prior 

probabilities, and Jli=i i^jl ) ^ s ^ n e multinomial distribution. Maximizing the 
likelihood, leads to the following estimations for parameters: 

V w T n k V w T 

~k <Lui=l ± n n il r>(ri \ L-ii=l J-ij /,o\ 

jl ~ T J T N T-n*' ( j) = ( } 

In the case of unknown gold standards, the likelihood is: 
n / j r j \ 

l> = P (A\n, p) = n e n n (4P ^ 

i=l k=l 1=1 J 

Maximizing this function is complicated, so the following EM algorithm is used 
to estimate the parameters: 

— Initialization: V7, k,i,j ^ i P(Ci) = j,^ = l,7r& = 0. 

— E Step: Extracting the sample labels, using previous step user confusion matri- 
ces and prior probabilities. Cj is the label of Xi with the following probability: 

P{T l3 = l|n, P) cx P(n, P\Tn = l)P{T. l3 = 1) (20) 
"nnWO""^) (21) 

k=l 1=1 

nk=iULi(^) <l p(c 3 ) 

= — s 22 

— M Step: Updating the user confusion matrices and the prior probabilities, by 
using the previous step extracted labels, and Eq. [TS1 

A Bayesian approach is also proposed for sensitivity and specificity modeling. 
This method assumes the following prior probabilities (Raykar et al, 2010): 

P(ai\a{,4)=p(ai\a{,4) (23) 
P(p^4,4)=j3(F\ti{,4) (24) 

P(p + \pi,P2) = P(p + \Pi,P2) ( 2 5) 

where a? and ft 3 are the sensitivity and specificity of user Uj, p + is the prior 

probability of class +1 (i.e. P{yi = +l\xi) = P(yt = +1) = p + ), a{ and a 3 2 are 
the number of true and false provided answers by user Uj to the problems of 
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class +1, b\ and b° 2 are the number of true and false provided answers by user 
Uj to the problems of class —1, pi, p2 are the total number of provided labels 
by all users to all problems of classes +1 and —1, and f3 is the beta probability 
distribution function. 

Similar to the previous method, the value of required variables in the E and 
M steps can be extracted, as follows (Raykar et al, 2010): 

fi i = P(y i = +l\x i ,A,a,l3, P +)cc (26) 

p+ai + (1 -p + )bi 



a ,-_ °i-i + E£iM 



Ei=i Mi 



fr j-i + SiIi(i-Mi)(i-^ ) 

&i+&i-2+Eti(i-M i ) 



1 i 

_ Pi ~ l + £i=iM 
where, 

ii 

a. t = P(A a , . . . , A iR \ yi =+l,a) = n K] /(A ^ =+1) [1 ^ ^y^-- 1 ) (30) 

fl 

6, = P(Aa, • ■ • , A iR \ Vi = -1, /3) = JJ [^] 7 ^=- 1 )[l - ^(^=+1) (31) 

i=i 

4-4-5. Belief propagation based reliability modeling 

A user reliability modeling and labels integration algorithm is proposed in (Karger 
et al, 2011; Karger et al, 2011). The authors utilize an iterative 'belief propagation'- 
like algorithm, for this purpose. The procedure is shown in Alg. [TJ In this al- 
gorithm, E is the set of edges of the bipartite graph which is specified by the 
adjacency matrix A, Qi denotes all neighbors of I in the graph, and S\k excludes 
k from the set S. 



(27) 



p = \. r 1 ( 28 ) 



+ = 1 (29) 

Pl +p 2 -2 + N v ' 



Algorithm 1 Belief propagation-inspired algorithm 

1: G E initialize PjXi with random Zij ~ N(l, 1) 

2: for k = 1, . .., k max do 

3: V(i, J) € B, s\% 4- £ ., £SA . A if $-? 

4: V(i, J) G B, pflt <- E^8, V 

5: Vi, Si « E/ee* Aj'Pj^T 

6: Vi, j)i < — sign(si) 



Liu et. al. proposed a graphical method with belief propagation inferring 
algorithm for crowd computing in (Liu et al, 2012). They showed that the pre- 
sented algorithm is a belief propagation based algorithm, if a Haldane prior 
(Zellner, 1971) is considered as the prior distribution of user reliabilities. 
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One of the most important features of the presented algorithm is its relation 
to low-rank matrices and Singular Value Decomposition (SVD), when A is a 
(I, r)-regular bipartite graph with I = r (Karger et al, 2011; Karger et al, 2011). 

Power iteration is a method to compute the leading singular vectors of a 
matrix. For matrix A mxn and two vectors u 6 R m and v € K n , starting with a 
random initialized v, power iteration iteratively updates u and v according to: 

Vi, Ui = AjjVj, Vj, Vj = ^AijUi (32) 



It is known that randomized u and v converges linearly to the leading left and 
right singular vectors. 

These update rules are very similar to the Alg. [TJ Then, the following algo- 
rithm can be used to estimate the question answers: 

1. Compute the left and right singular vector of A, corresponding to the top 
singular values of A. 

2. Since both (u,v) and (—u,—v) are valid pairs of leading singular vectors, 
the mass of the element values is considered to resolve the ambiguity, if 
J2j:v 3 ->o v j ^ J2j-.v j< o v ]^ tnen & = si 9 n { u i), otherwise fa = sign(-Ui). 

Note that the Alg. Q] rules, an d the power iteration rules are not exactly the 
same. In the updating rules of the algorithm, the received signals from the desti- 
nation will be excluded ('\j's in the algorithm). But, these signals are considered 
in the power iteration. In other words, the power iteration rules is the simplifica- 
tion of the algorithm's rules, because the latter approximates all different s^j 
with a common u,. 

Some intuitions are proposed in the (Karger et al, 2011) to justify why the 
top left singular vector of A reveals the estimated labels. 

Moreover, the obtained leading singular vectors can be used to approximate 
a low-rank matrix. 

Having matrix A with rank r, in low-rank matrix approximation, the goal is 
to approximate A with a matrix A' of rank at most k (fc < r), such that: 

A' = argmin z]rand(z)=k \\A - z\\ F (33) 

According to the Eckart & Young theorem (Eckart et al, 1936) we have: 
min z \ rank(z) = k \\A - z\\ F = \\A - A k \\ F (34) 

where A k is approximated using VE k V T , U&ndV are left and right singular 
vectors of A, S is a diagonal matrix containing sorted singular values of A 
(A = UHV T ), and is formed by replacing the r — k smallest singular values 
on the diagonal of S to zero. Then, as mentioned above, the second algorithm 
uses the rank-1 approximation of A. 
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Fig. 4. GLAD's hidden Markov model. 



4-4-6- User expertise modeling, considering the difficulty level of 
problems 

None of the above models consider the difficulty level of problems. While there 
are some evidences which show that considering the difficulty level of problems 
in user modeling is useful, for some application^]- 

The "GLAD" algorithm is proposed to model the users' expertise and the 
difficulty level of problems (Whitehill et al, 2009). GLAD can also find the noisy 
and adversarial labelers, and in case of many number of adversarial labels, it can 
utilize them to produce higher quality results. 

GLAD models the difficulty of problem Xi using the parameter 1/ Pi £ [0, inf) 
where the higher value of the parameter shows the higher difficulty level of the 
problem. Also, GLAD models the expertise of labeler Uj using the parameter 
a.j € (— inf, + inf), where the higher value of ay indicates the higher level of 
expertise, and oy < indicates the adversarial labeler. 

The graphical schema of the model's dependencies is shown in the Fig. |4] (the 
shaded variables are observed, and the others are latent). 

GLAD uses the following logistic model: 

P(Aij = yi\aj,Pi) = a( aj Pi) = - - *_ ajf)i (35) 
GLAD uses an EM algorithm to estimate the latent variables. For E step, 



7 For example, the experiments on five categories of problems in (Wais et al, 2010) show that 
using the same users for all problem categories leads to different results. Or, in some categories 
the majority vote is better than the answers of the best user in the system, while in some others 
the opposite is true. As another evidence, in some problems the diversity of user answers is 
very high, while in some others it is low (Brew et al, 2010). 
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the posterior probability of yi given A, a and (3 is: 

P{y i \A,OL,(3) = P{y i \A i ,OL,p i ) (36) 

P{y i \ot,^)P{A i \y i ,a,p i ) (37) 



P(Ai\<*,Pi) 

aP(y i |a,A)i , (^|l/i,a,ft) (38) 

oc^d/i) II (39) 

where, 

P(^|j/i = +l,aj,/3i) - ff (a 3 ft) /(A - +1) (1 - a( aj -A)) /(Asj=_1) (40) 

P{A l0 \ Vl = -l.a^ft) = afoft)'^- 1 ) (1 - afoft))'^"^ (41) 

In M step, to calculate the best values for the parameters, we maximize the 
expectation of the joint log-likelihood of the observed and hidden variables given 
the parameters. The objective function is: 

Q( a ,t3)=E[lnP(A,Y\ a ,p)} (42) 
= Y, E \- lnP (y^ +Y, E [lnP(Aij\yi, <*,,&)] (43) 

i ij 

= ]T {p + lnP{Vi = +1) +p-lnP( yi = -1)} (44) 

i 

+ J2 {p + ln<j( a ^ l )+p-ln(l-a(a^ l ))} (45) 

tj|Ay=+l 

+ {P'lnaiarfi) +p+ln(l - a{a 3 f3 t ))} (46) 

y|A y =-l 

where p+ and p~ are the classes priors from the previous step (are calculated 
using a otd and (3 old ). 

Setting the gradients of Q to zero results in non-linear equations. Thus, the 
maximization process need to be solved using iterative methods. 

GLAD is the first model which simultaneously estimates the true label, prob- 
lems difficulty, and users' expertise. But, similar previous methods it does not 
consider the sample selection and labels acquiring phase. Recently, inspiring the 
GLAD, some other methods, such as (Welinder et al, 2010), are proposed for 
modeling difficulty levels of the problems. 

Conclusion: Uniform accuracy is the simplest parameter for user exper- 
tise modeling. But, since it considers one accuracy parameter for all users, it 
is not sufficiently accurate. Non-uniform accuracy modeling considers separated 
parameters for each of users. SFiltcr suppose that the user accuracies are time 
varying. But, it uses a simple Markov model for modeling these changes. Also, it 
assumes that the maximum rate of change is small, known and is the same for all 
user accuracies. The last method that utilizes accuracy as its model parameter 
is IEThresh. This method models each user by an accuracy interval. IEThresh is 
useful in active systems for task assignment. In each step, it assigns the selected 
problem to the user with highest interval upper bound. It leads to good results, 
even in presence of a large number of noisy users. 

Confusion matrix is another parameter set for user modeling. This parameter 
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set includes more detailed information about the user expertise. It is useful in 
problems that users have different performances in facing different classes of 
problems. 

The GLAD method considers the difficulty level of problems, which it might 
be useful in some applications. But, it leads to the nonlinear optimization prob- 
lem. Therefore, it is complicated, and is difficult for use in adaptive methods. 

Non- uniform accuracy, confusion matrix based, and GLAD methods optimize 
the likelihood of getting true answers for all samples to calculate the model 
parameters. There are two main problems in this approach. The first problem 
is assuming the independency between the provided labels by different users. 
Since, there are some correlations between provided answers by some of users, 
this assumption is not true in all cases. Another problem is special to methods 
that use EM algorithm to estimate the optimum values for parameters. EM is 
sensitive to initialization values, and it might not converge to an acceptable set 
of values. 

The belief propagation based algorithm uses reliability as the parameter for 
the user models. Unlike the EM-based methods, this algorithm is not sensitive 
to the initial settings. An SVD-based version of this iterative algorithm is also 
proposed. Although this method relates to the SVD and low-rank matrices with 
a rich theory behind them, but this equivalency only holds when the questions 
to labelers assignment graph is a (I, r)-regular bipartite graph with I = r. This 
condition is not hold usually, in practice. Another weakness of this method is sim- 
plifications in converting the original iterative algorithm to a power iteration-like 
algorithm. Finally, as the last weakness, it is only suitable for binary classification 
problems. 

4.5. Labels integration 

The simplest strategy for integrating the collected labels and estimating the 
sample's label is majority voting. There is a complexity in using majority voting. 
Consider five acquired labels from users with accuracies 0.55, 0.85, 0.75, 0.6, and 
0.8. The correctness probability of the majority vote, majority of the best three 
labels, and the best label are 0.86, 0.90, and 0.85, respectively. This example 
illustrates that filtering out two worse labels leads to the best result. Then, in 
some cases it is recommended to filter out some of acquired labels. In general, 
majority voting does not consider the quality of responses. 

Probabilistic methods and belief propagation-inspired algorithm consider the 
quality of responses through user models. They are more efficient than majority 
voting. In these methods, integrating the collected labels depends on the type 
of user models. We probed the suitable methods, for each type of user modeling 
methods, in the previous section. 

4.6. Experimental results 

There are some comments on the validity of reported experimental results in 
some related papers. In those papers, the authors did not have access to real 
datasets. Therefore, they have produced their own synthetic data, according to 
their assumed models. Then, they show that their methods can efficiently find 
the user models and the sample labels. Clearly, these results may not be valid. In 
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#SampeIs 


#Labelers 


#Labels 


#Labels/Sample 


rteDS 


800 


164 


8000 


10 


tempDS 


462 


76 


4620 


10 


DuchenneUS 


159 


17 


1950 


8-15 



Table 1. The properties of the datasets that are used in experimental tests. 
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Fig. 5. The histograms of user accuracies (top row), and the histograms of re- 
sponses' accuracies (bottom row) for utilized datasets. 



some other papers, the authors used fair assumptions in producing the synthetic 
data, but they have not considered the costs, and have produced a large number 
of labels for each sample. The experimental results of these papers are valid 
theoretically, but they may not be useful in practice. 

To compare the surveyed methods, we have implemented all of them. In 
addition, we utilized three real datasets. These datasets include binary classifi- 
cation problems, where the labels of their problems are acquired from MTurk 
users. In each question of the recognizing textual entailment dataset (rteDS), 
two sentences are presented to users and a binary choice of whether the second 
hypothesis sentence can be inferred from the first one, or not (Snow et al, 2008). 
In each question of temporal event recognition dataset (tempDS), the users must 
choose one of the two labels "strictly before" or "strictly after" to represent the 
temporal relation between two event-pairs (Snow et al, 2008). In each face image 
of the Duchenne dataset, users are asked to determine whether the face contains 
Duchenne smile ("enjoyment" smile) or not (is a "social" smile)? (Whitehill et 
al, 2009). Table [T] contains the properties of these datasets. 

The histogram of user accuracies, and the histogram of collected labels' ac- 
curacies are shown in Fig. [5j for all datasets. 

Implementations show that the results of all sample selection methods in case 
of higher budgets are almost the same. While, using a few number of average 
labels per each sample, the uncertainty leads to better results than the other 
criteria (uniform and heterogeneity). 

The comparison of sample selection criteria for tempDS is shown in Fig. [5] 
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Comparing sample selection methods (tempDS) 
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Fig. 6. Comparison of sample selection criteria for temp dataset. 
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Fig. 7. Comparison of user modeling methods for rte dataset. 



Other datasets lead also to similar result^- All reported results are the average 
of 20 runs. 

In user modeling tests, in all datasets, majority voting (equal models for all 
users) leads to the worst results, while GLAD leads to the best results. The 
comparison of user modeling methods for rteDS is shown in Fig. [7] 

In Fig. [5] we compare the majority voting (uniform sample selection, equal 
models for all users), uncertainty sample selection (equal models for all users), 
and user accuracy modeling (uniform sample selection) with the uncertainty 
alongside accuracy user modeling method. Note that the last method calculates 
sample uncertainties by using the correctness probability of estimated labels, 
which is obtained by utilizing user accuracy models. The results show that utiliz- 



8 The source codes and all comparison result diagrams are available at the first author's web 
page at [http:/ /ce. sharif.edu/~muham madi 
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Comparing SS and UM methods (tempDS) 




Fig. 8. Comparison of using sample selection and user modeling with and without 
each other for temp dataset. 
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Fig. 9. Comparison of the best sample selection method, the best user modeling 
method, and the best user responses for Duchenne dataset. 



ing both sample selection and user modeling leads to worse results than utilizing 
each of them, solely. 

Finally, we compared the best sample selection method (uncertainty), and 
the best user modeling method (GLAD) with the best user responses. In the 
latter one, we assumed that we know the actual user accuracies, and for each 
sample we selected the label of the best user, as the estimated label. The results 
for Duchenne dataset are shown in Fig. GO 

All obtained results for rte, temp and Duchenne datasets are shown in Tables 
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1 lps 


3 lps 


5 lps 


7 lps 


9 lps 


Majority voting 


26.84 


19.84 


15.63 


12.94 


10.36 


Entropy 


26.84 


23.48 


20.93 


16.60 


12.22 


Uncertainty 


26.84 


16.29 


11.91 


10.03 


10.74 


Accuracy 


26.84 


16.63 


11.76 


10.73 


10.56 


Acc. & Uncert. 


29.03 


15.32 


14.49 


13.45 


11.43 


Sen., Spe. 


26.84 


15.27 


11.22 


10.26 


9.74 


Reliability 


26.84 


19.69 


12.29 


10.25 


8.62 


GLAD 


26.84 


14.77 


10.69 


8.91 


7.76 



Table 2. The errors (%) of all methods for rte dataset. Each column denotes a 
specified number of average labels per sample. 





1 lps 


3 lps 


5 lps 


7 lps 


9 lps 


Majority voting 


26.37 


16.46 


10.94 


7.44 


6.35 


Entropy 


26.37 


21.37 


17.13 


12.84 


8.39 


Uncertainty 


26.37 


12.78 


7.85 


6.68 


6.26 


Accuracy 


26.37 


10.49 


8.25 


7.34 


6.94 


Acc. & Uncert. 


26.83 


10.31 


9.37 


9.42 


8.02 


Sen., Spe. 


26.37 


10.90 


7.91 


7.21 


6.94 


Reliability 


26.37 


11.98 


8.27 


7.12 


6.30 


GLAD 


26.37 


11.14 


7.76 


6.33 


6.06 



Table 3. The errors (%) of all methods for temp dataset. Each column denotes 
a specified number of average labels per sample. 



5. Conclusion 

Crowd computing is a new field in computer science that combines the strengths 
of both humans and computers to create systems which have never existed before. 
In this paper we surveyed the crowd computing from three aspects: what, why, 
and how? 

First, in the "what" part we introduced the fundamental concepts and the 
various types of crowd computing systems. Then, we explained the properties 
of suitable problems and applications for crowd computing. Then, we illustrated 
the performance of some existing systems. Finally, we described the required 
steps to design a crowd computing system. 

In the "why" part, we introduced human's consciousness and common sense 
knowledge as the shortcoming of the current computer systems, in comparison 
to humans. Then, we discussed their roles in creating more intelligent systems, 
using crowd computing. 

In the "how" part, we presented a survey on solving classification problems 
using crowd computing. We divided the past works to three sections: sample 





1 lps 


3 lps 


5 lps 


7 lps 


Majority voting 


37.74 


33.14 


31.76 


30.19 


Entropy 


37.74 


35.28 


33.21 


30.82 


Uncertainty 


37.74 


31.38 


29.59 


28.21 


Accuracy 


37.74 


31.38 


28.49 


27.89 


Acc. & Uncert. 


46.48 


33.74 


32.61 


35.03 


Sen., Spe. 


37.74 


28.77 


28.11 


28.40 


Reliability 


37.74 


33.46 


27.45 


26.32 


GLAD 


37.74 


29.72 


26.35 


25.31 



Table 4. The errors (%) of all methods for Duchenne dataset. Each column 
denotes a specified number of average labels per sample. 
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Overall CJuidelines - Focusing more on budget and costs. 

- Calculating realistic error bounds for results. 

- Using sparse representation methods. 

- Handling multi-class, and non-classification problems. 
Sample Selection - Utilizing user models in developing sample selection criteria. 

- Considering the overall performance in each step. 

- Considering the estimation of the changes in the overall perfor- 
mance, after getting a new label for a sample. 

- Research on the role of factors such as exploration alongside 
exploitation, or deterministic vs. proportionally random sample 
selection. 



User Modeling 


- Proposing realistic time varying user models. 

- Relaxing the independency assumption between collected labels 
in likelihood-based methods. 


Labels Integration 


- Detecting and filtering out low-quality labels. 



Table 5. Guidelines for future works based on the current open issues. 



selection, user modeling, and labels integration. We also compared the surveyed 
methods. In addition, we discussed the strengths and weaknesses. Moreover, we 
proposed some topics as guidelines for future works. 

Considering the technical open issues, we must point to the following topics: 
1) Little researches is done on adaptive scenarios, and this field requires more 
attention. Current sample selection criteria are not effective in presence of user 
models. Then, concentration on efficient combination of sample selection criteria 
and user modeling methods is a good topic for more research. 2) In case of large 
amount of budget, almost all methods works well. The problem arises when the 
budget is low (i.e. the user performances are low, or the number of labels are 
few). Then, the future research must focus more on the cost models. 3) Finally, 
calculating realistic error bounds for results is another open issue for further 
research. 

Table [5] contains the summary of all open issues for feature research in this 
field. 
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